Lessons Learned in Deploying the World’s Largest Scale Lustre File System
نویسندگان
چکیده
The Spider system at the Oak Ridge National Laboratory’s Leadership Computing Facility (OLCF) is the world’s largest scale Lustre parallel file system. Envisioned as a shared parallel file system capable of delivering both the bandwidth and capacity requirements of the OLCF’s diverse computational environment, the project had a number of ambitious goals. To support the workloads of the OLCF’s diverse computational platforms, the aggregate performance and storage capacity of Spider exceed that of our previously deployed systems by a factor of 6x 240 GB/sec, and 17x 10 Petabytes, respectively. Furthermore, Spider supports over 26,000 clients concurrently accessing the file system, which exceeds our previously deployed systems by nearly 4x. In addition to these scalability challenges, moving to a center-wide shared file system required dramatically improved resiliency and fault-tolerance mechanisms. This paper details our efforts in designing, deploying, and operating Spider. Through a phased approach of research and development, prototyping, deployment, and transition to operations, this work has resulted in a number of insights into large-scale parallel file system architectures, from both the design and the operational perspectives. We present in this paper our solutions to issues such as network congestion, performance baselining and evaluation, file system journaling overheads, and high availability in a system with tens of thousands of components. We also discuss areas of continued challenges, such as stressed metadata performance and the need for file system quality of service alongside with our efforts to address them. Finally, operational aspects of managing a system of this scale are discussed along with real-world data and observations.
منابع مشابه
Trillion Particles , 120 , 000 cores and 350 TBs : Lessons Learned from a Hero I / O Run on Hopper *
Modern petascale applications can present a variety of configuration, runtime, and data management challenges when run at scale. In this paper, we describe our experiences in running VPIC, a large-scale plasma physics simulation, on the NERSC production Cray XE6 system Hopper. The simulation ran on 120,000 cores using ∼80% of computing resources, 90% of the available memory on each node and 50%...
متن کاملA Next-Generation Parallel File System Environment for the OLCF
When deployed in 2008/2009 the Spider system at the Oak Ridge National Laboratory’s Leadership Computing Facility (OLCF) was the world’s largest scale Lustre parallel file system. Envisioned as a shared parallel file system capable of delivering both the bandwidth and capacity requirements of the OLCF’s diverse computational environment, Spider has since become a blueprint for shared Lustre env...
متن کاملEfficient Object Storage Journaling in a Distributed Parallel File System
Journaling is a widely used technique to increase file system robustness against metadata and/or data corruptions. While the overhead of journaling can be masked by the page cache for small-scale, local file systems, we found that Lustre’s use of journaling for the object store significantly impacted the overall performance of our large-scale centerwide parallel file system. By requiring that e...
متن کاملDistributed File Recovery on the Lustre Distributed File System
With the advancement of cloud-computing technologies and the growth in distributed software applications, a great deal of research that has been focused on the concepts and implementations of distributed file systems to support these application. Since its inception in 1999 by Peter Braam at Carnegie Mellon University, the Lustre distributed file system has gained both the technical, as well as...
متن کاملRegionalization of the Iowa State University Extension System: Lessons Learned by Key Administrators
The cyclical economic downturn in the United States has forced many Extension administrators to rethink and adjust services and programming. The Cooperative Extension System (CES), the organization primarily responsible for governmental Extension work in the United States, at Iowa State University responded to this economic downturn by restructuring its organization from county based to a regio...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010